48 research outputs found
Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space
While the recent advances in research on video reenactment have yielded
promising results, the approaches fall short in capturing the fine, detailed,
and expressive facial features (e.g., lip-pressing, mouth puckering, mouth
gaping, and wrinkles) which are crucial in generating realistic animated face
videos. To this end, we propose an end-to-end expressive face video encoding
approach that facilitates data-efficient high-quality video re-synthesis by
optimizing low-dimensional edits of a single Identity-latent. The approach
builds on StyleGAN2 image inversion and multi-stage non-linear latent-space
editing to generate videos that are nearly comparable to input videos. While
existing StyleGAN latent-based editing techniques focus on simply generating
plausible edits of static images, we automate the latent-space editing to
capture the fine expressive facial deformations in a sequence of frames using
an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2.
The encoding thus obtained could be super-imposed on a single Identity-latent
to facilitate re-enactment of face videos at . The proposed framework
economically captures face identity, head-pose, and complex expressive facial
motions at fine levels, and thereby bypasses training, person modeling,
dependence on landmarks/ keypoints, and low-resolution synthesis which tend to
hamper most re-enactment approaches. The approach is designed with maximum data
efficiency, where a single latent and 35 parameters per frame enable
high-fidelity video rendering. This pipeline can also be used for puppeteering
(i.e., motion transfer).Comment: The project page is located at
https://trevineoorloff.github.io/ExpressiveFaceVideoEncoding.io
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2
While recent research has progressively overcome the low-resolution
constraint of one-shot face video re-enactment with the help of StyleGAN's
high-fidelity portrait generation, these approaches rely on at least one of the
following: explicit 2D/3D priors, optical flow based warping as motion
descriptors, off-the-shelf encoders, etc., which constrain their performance
(e.g., inconsistent predictions, inability to capture fine facial details and
accessories, poor generalization, artifacts). We propose an end-to-end
framework for simultaneously supporting face attribute edits, facial motions
and deformations, and facial identity control for video generation. It employs
a hybrid latent-space that encodes a given frame into a pair of latents:
Identity latent, , and Facial deformation latent,
, that respectively reside in the and spaces of
StyleGAN2. Thereby, incorporating the impressive editability-distortion
trade-off of and the high disentanglement properties of . These hybrid
latents employ the StyleGAN2 generator to achieve high-fidelity face video
re-enactment at . Furthermore, the model supports the generation of
realistic re-enactment videos with other latent-based semantic edits (e.g.,
beard, age, make-up, etc.). Qualitative and quantitative analyses performed
against state-of-the-art methods demonstrate the superiority of the proposed
approach.Comment: The project page is located at
https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io
COVID-VTS: Fact Extraction and Verification on Short Video Platforms
We introduce a new benchmark, COVID-VTS, for fact-checking multi-modal
information involving short-duration videos with COVID19- focused information
from both the real world and machine generation. We propose, TwtrDetective, an
effective model incorporating cross-media consistency checking to detect
token-level malicious tampering in different modalities, and generate
explanations. Due to the scarcity of training data, we also develop an
efficient and scalable approach to automatically generate misleading video
posts by event manipulation or adversarial matching. We investigate several
state-of-the-art models and demonstrate the superiority of TwtrDetective.Comment: 11 pages, 5 figures, accepted to EACL202
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Despite the promising progress in multi-modal tasks, current large
multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions
with respect to the associated image and human instructions. This paper
addresses this issue by introducing the first large and diverse visual
instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16
vision-and-language tasks with open-ended instructions and answers. Unlike
existing studies that primarily focus on positive instruction samples, we
design LRV-Instruction to include both positive and negative instructions for
more robust visual instruction tuning. Our negative instructions are designed
at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent
Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure
the hallucination generated by LMMs, we propose GPT4-Assisted Visual
Instruction Evaluation (GAVIE), a stable approach to evaluate visual
instruction tuning like human experts. GAVIE does not require human-annotated
groundtruth answers and can adapt to diverse instruction formats. We conduct
comprehensive experiments to investigate the hallucination of LMMs. Our results
demonstrate existing LMMs exhibit significant hallucinations when presented
with our negative instructions, particularly Existent Object and Knowledge
Manipulation instructions. Moreover, we successfully mitigate hallucination by
finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving
performance on several public datasets compared to state-of-the-art methods.
Additionally, we observed that a balanced ratio of positive and negative
instances in the training data leads to a more robust model.Comment: 40 pages, 32 figures. Under Revie
Human Emotion Recognition from Motion Using a Radial Basis Function Network Architecture
(Also cross-referenced as CAR-TR-721)
In this paper a radial basis function network architecture is
developed that learns the correlation between facial feature motion
patterns and human emotions. We describe a hierarchical approach which at
the highest level identifies emotions, at the mid level determines motions
of facial features, and at the low level recovers motion directions.
Individual emotion networks were trained to recognize the 'smile" and
"surprise" emotions. Each network was trained by viewing a set of
sequences of one emotion for many subjects. The trained neural network was
then tested for retention, extrapolation and rejection ability. Success
rates were about 88% for retention, 73Wo for extrapolation, and 79% for
rejection
Temporal Multi-scale Models for Flow and Acceleration
A model for computing image flow in image sequences containing a very wide range of instantaneous flows is proposed. This model integrates the spatio-temporal image derivatives from multiple temporal scales to provide both reliable and accurate instantaneous flow estimates. The integration employs robust regression and automatic scale weighting in a generalized brightness constancy framework. In addition to instantaneous flow estimation the model supports recovery of dense estimates of image acceleration and can be readily combined with parameterized flow and acceleration models. A demonstration of performance on image sequences of typical human actions taken with a high frame-rate camera is given. 1 Introduction Image motion estimation involves relating temporal changes in image intensity across the spatial dimensions. Articulated and deformable motions such as those encountered in images of humans in motion give rise to image sequences having, instantaneously, a wide range of flow m..
Parameterized Modeling and Recognition of Activities
In this paper we consider a class of human activities-atomic activities- which can be represented as a set of measurements over a finite temporal window (e.g., the motion of human body parts during a walking cycle) and which has a relatively small space of variations in performance. A new approach for modeling and recognition of atomic activities that employs principal component analysis and analytical global transformations, is proposed. The modeling of sets of exemplar instances of activities that are similar in duration and involve similar body part motions is achieved by parameterizing their representation using principal component analysis. The recognition of variants of modeled activities is achieved by searching the space of admissible parameterized transformations that these activities can undergo. This formulation iteratively refines the recognition of the class to which the observed activity belongs and the transformation parameters that relate it to the model in its class. W..
Tracking Rigid Motion using a Compact-Structure Constraint
An approach for tracking the motion of a rigid object using parameterized flow models and a compact-structure constraint is proposed. While polynomial parameterized flow models have been shown to be effective in tracking the rigid motion of planar objects, these models are inappropriate for tracking moving objects that change appearance revealing their 3D structure. We extend these models by adding a structure-compactness constraint that accounts for image motion that deviates from a planar structure. The constraint is based on the assumption that object structure variations are limited with respect to planar object projection onto the image plane and therefore can be expressed as a direct constraint on the image motion. The performance of the algorithm is demonstrated on several long image sequences of rigidly moving objects. 1 Introduction Tracking a moving object in an image sequence is a fundamental capability of a vision system. Tracking can be defined as the process of identify..